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Abstract 

Most generalization bounds in learning theory are based on some measure of the complexity of the 
hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stabil- 
ity can be used to derive tight generalization bounds that are tailored to specific learning algorithms 
by exploiting their particular properties. However, as in much of learning theory, existing stability 
analyses and bounds apply only in the scenario where the samples are independently and identically 
distributed. In many machine learning applications, however, this assumption does not hold. The 
observations received by the learning algorithm often have some inherent temporal dependence. 

This paper studies the scenario where the observations are drawn from a stationary (^-mixing or 
/3-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a 
dependence between observations weakening over time. We prove novel and distinct stability -based 
generalization bounds for stationary (/j-mixing and /3-mixing sequences. These bounds strictly 
generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby 
extending the use of stability-bounds to non-i.i.d. scenarios. 

We also illustrate the application of our (ys-mixing generalization bounds to general classes of 
learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support 
Vector Machines, and many other kernel regularization-based and relative entropy-based regular- 
ization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use 
of these algorithms in non-i.i.d. scenarios. 

Keywords: Mixing Distributions, Algorithmic Stability, Generalization Bounds, Machine Learn- 
ing Theory 



1. Introduction 

Most generalization bounds in learning theory are based on some measure of the complexity of 
the hypothesis class used, such as the VC-dimension, covering numbers, or Rademacher com- 
plexity. These measures characterize a class of hypotheses, independently of any algorithm. In 
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contrast, the notion of algorithmic stability can be used to derive bounds that are tailored to spe- 
cific learning algorithms and exploit their particular properties. A learning algorithm is stable if 
the hypothesis it outputs varies in a limited way in response to small changes made to the training 
set. Algorithmic stability has been u sed effectively in the past to derive tight generalization bounds 
dfiousquet and ElisseefAE()0lll2002h . 

But, as in much of learning theory, existing stability analyses and bounds apply only in the 
scenario where the samples are independently and identically distributed (i.i.d.). In many machine 
learning applications, this assumption, however, does not hold; in fact, the i.i.d. assumption is not 
tested or derived from any data analysis. The observations received by the learning algorithm often 
have some inherent temporal dependence. This is clear in system diagnosis or time series prediction 
problems. Clearly, prices of different stocks on the same day, or of the same stock on different days, 
may be dependent. But, a less apparent time dependency may affect data sampled in many other 
tasks as well. 

This paper studies the scenario where the observations are drawn from a stationary (/9-mixing or 
/3-mixing sequence, a widely adopted assumption in the stud y of non-i.i.d . processes that irnpUes a 
dependence betwee n observations weakening over time , 1994 ; Meir , 2000. : ,Vidyasagar , 2003 



Lozano et al.Ll2006f) . We prove novel and distinct stability-based generalization bounds for station- 



ary 99-mixing and /3-mixing sequences. These bounds strictly generalize the bounds given in the 
i.i.d. case and apply to all stable learning algorithms, thereby extending the usefulness of stability- 
bounds tonon-i.i.d. scenarios. Our proof s are b ased on the independent block technique described 
by IYuI (I1994[) and attributed to iBemsteinI (Il927h . which is commonly used in such contexts. How- 
ever, our analysis differs from previous uses of this technique in that the blocks of points considered 
are not of equal size. 

For our analysis o f stationary g;-mixing se quences, we make use of a generalized version of Mc- 
Diarmid's inequality (IKontorovich and Rama nan. 2006) that holds for 99-mixing sequences. This 
leads to stability-based generalization bounds with the standard exponential form. Our general- 
ization bounds for stationary /?-mixing sequences cover a more general non-i.i.d. scenario and use 
the standard McDiarmid's inequality, however, unlike the 99-mixing case, the /^-mixing bound pre- 
sented here is not a purely exponential bound and contains an additive term depending on the mixing 
coefficient. 

We also illustrate the application of our 99-mixing generaliza tion bounds to general classes of 
learning algori thms, including Support Vector Regression (SVR) (IVapnild.ll998h. Ker nel Ridge Re- 
gression (S aunders et al.Lll998h . and Support Vector Machines (SVMs') dCortes and Vaonik. 1995.) . 
Algorithms such as support vector regression (SVR) ( Vapnik . 1998 : Scholkopf and Smolal 2002) 
have been used in the context of time series pr ediction in yyhich the i.i.d. assurnption does not 
hold, some with good experimental results (Miill er et al. , 1997 : Mattera and Haykin . 1999 h. To our 
knowledge, the use of these algorithms in non-i.i.d. scenarios has not been previously supported 
by any theoretical analysis. The stability bounds we give for SVR, SVMs, and many other kernel 
regularization-based and relative entropy-based regularization algorithms can thus be viewed as the 
first theoretical basis for their use in such scenarios. 

The following sections are organized as follows. In Section |2l we introduce the necessary 
definitions for the non-i.i.d. problems that we are considering and discuss the learning scenarios 
in that context. Section Ogives our main generalization bounds for stationary (^-mixing sequences 
based on stability, as well as the illustration of its applications to general kernel regularization-based 
algorithms, including SVR, KRR, and SVMs, as well as to relative entropy-based regularization al- 
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gorithms. Finally, Section|4]presents the first known stability bounds for the more general stationary 
/^-mixing scenario. 



2. Preliminaries 



We fi rst introduce some standard definitions for dependent observations in mixing theory ( DoukhanL 



19941) and then briefly discuss the learning scenarios in the non-i.i.d. case. 



2.1 Non-i.i.d. Definitions 

Definition 1 A sequence of random variables Z = {Zt\'^_^ is said to be stationary if for any t 
and non-negative integers m and k, the random vectors {Zt, . . . , Zt+m) cind (Zf+k, . . . , Zt+m+k) 
have the same distribution. 



Thus, the index t or time, does not affect the distribution of a variable Zt in a stationary sequence. 
This does not imply independence however. In particular, for i < j < k, Pr[Zj \ Zi] may not 
equal Pv[Zk \ Zi\. The following is a standard definition giving a measure of the dependence of the 
random variables Zt within a stationary sequence. There are several equivalent definitions of this 
quantity, we are adopting here that of 0^ J994,)- 



Definition 2 Let Z = {Zt}^ ^ be a stationary sequence of random variables. For any i,j G 
Z U {— oo, +oo}, let aj denote the a-algebra generated by the random variables Zj., i < k < j. 
Then, for any positive integer k, the (5-mixing and ip-mixing coefficients of the stochastic process Z 
are defined as 



P{k) = sup E 



sup 



Pr[A I B] - Fr[A] if{k) 



sup 

n 



Vt[A I B] - Vt[A] 



(1) 



Z is said to be (3-mixing (ip-mixing) if (5{k) (resp. ip{k) 0) as k ^ oo. It is said to be 
algebraically /3-mixing ("algebraically (/?-mixingj if there exist real numbers (3q > (resp. ipo > 0) 
and r > such that f3{k) < Po/k'" (resp. ip{k) < ipo/k^)for all k, exponentially mixing if there 
exist real numbers Pq (resp. ipo > 0) and (3i (resp. tpi > 0) such that f3{k) < (3q exp(— /3i/c'") (resp. 
ip{k) < ifQexp{—ipik^))for all k. 



Both (3{k) and ip{k) measure the dependence of an event on those that occurred more than k units 
of time in the past. /3-mixing is a weaker assumption than (/7-mixing and thus covers a more general 
non-i.i.d. scenario. 

This paper gives stability -based generalization bounds both in the 99-mixing and /3-mixing case. 
The /3-mixing bounds cover a more general case of course, however, the (^-mixing bounds are 
simpler and admit the standard exponential form. The (^-mixing bounds are based on a concentration 
inequality that applies to 99-mixing processes only. Except from the use of this concentration bound, 
all of the intermediate proofs and results to derive a (/j-mixing bound in Section [3] are given in the 
more general case of /3-mixing sequences. 



3 



MOHRI AND ROSTAMIZADEH 



It has been argued by IVidyasagaii (120031) that /3-mixing is "just the right" assumption for the 
analysis of weakly-dependent sample points in machine learning, in particular because several PAC- 
learning results then cany over to the non-i.i.d. case. Our /3-mixing generalization bounds further 
contribute to the analysis of this scenario Q 

We describe in several instances the application of our bounds in the case of algebraic mixing. 
Algebraic mixing is a standard assumption for mixing coefficients t hat has been adopted in previous 
studies of learning i n th e presence of dependent observations (lYul 1 19941 : iMeirl l200d : IVidyasagar , 



20031 : iLozano et al.L 12006) . 



Let us also point out th at mixing as sumptions can be checked in some cases such as with Gaus- 
sian or Markov processes (IMeiti BOOOl) and that mixing parameters can also be estimated in such 
cases. 

Most previous studies use a technique origin ally introduced by 



BemsteinI (119271) based on in- 



dependent blocks of equal size (Yu, 1994; Meir, 20001 : iLozano et al.Ll2006r) . This technique is par- 
ticularly relevant when dealing with stationary /3-mixing. We will need a related but somewhat 
different technique since the blocks we consider may not have the same size. The following lemma 
is a special case of Corollary 2.7 from (lYull 19941) . 



Lemma 3 (Yu JYiiLll994h . Corollary 2.7) Let > 1 and suppose that h is measurable function, 



with absolute value bounded by M, on a product probability space ^11^=1 ^j)E[f=i ^rl^ where 
fi < Si < Tj+i for all i. Let Q be a probability measure on the product space with marginal 
measures Qi on (0,i, o'p_), and let Q*^^ be the marginal measure of Q on ^Hjii ^j' Tljii "^^j )' 
i = I, . . . , H - 1. Let P{Q) = supi<j<^_i P{ki), where ki = rj+i - Si, and P = nf=i Qi- Then, 

|E[/i]-E[/.]| <(/i-l)M/3(Q). (2) 

The lemma gives a measure of the difference between the distribution of /x blocks where the blocks 
are independent in one case and dependent in the other case. The distribution within each block 
is assumed to be the same in both cases. For a monotonically decreasing function /5, we have 
f3{Q) = f3{k*), where k* = minj(A;j) is the smallest gap between blocks. 

2.2 Learning Scenarios 

We consider the familiar supervised learning setting where the learning algorithm receives a sample 
of m labeled points S = (zi, . . . , z^) = ((xi, yi), . . . , (xm, y-m)) ^ {X x Y)'"^, where X is the 
input space and Y the set of labels (F = M in the regression case), both assumed to be measurable. 

For a fixed learning algorithm, we denote by hs the hypothesis it returns when trained on the 
sample S. The error of a hypothesis on a pair 2 G X x y is measured in terms of a cost function c : 
Y xY ^ M+. Thus, c{h{x), y) measures the error of a hypothesis /i on a pair (x, y), c{h{x), y) = 
{h{x) — y)^ in the standard regression cases. We will use the shorthand c{h, z) := c{h{x),y) for a 
hypothesis h and z = {x,y) £ X xY and will assume that c is upper bounded by a constant M > 0. 



1. Some results have also been obt ained in the more general context of a-mixing but they seem to require the stronger 
condition of exponential mixing jModha and Masrvlll998t) . 
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We denote by R{h) the empirical error of a hypothesis h for a training sample S = {zi, . . . , Zm)' 

^ m 

R{h) = -S2c{h,Zi). (3) 

In the standard machine learning scenario, the sample pairs zi, . . . , are assumed to be i.i.d., a 
restrictive assumption that does not always hold in practice. We will consider here the more general 
case of dependent samples drawn from a stationary mixing sequence Z over X x F. As in the i.i.d. 
case, the objective of the learning algorithm is to select a hypothesis with small error over future 
samples. But, here, we must distinguish two versions of this problem. 

In the most general version, future samples depend on the training sample S and thus the gen- 
eralization error or true error of the hypothesis trained on S must be measured by its expected 
error conditioned on the sample S: 

R{hs) = n<^s.z)\S\. (4) 

z 

This is the most realistic setting in this context, which matches time series prediction problems. 
A somewhat less realistic version is one where the samples are dependent, but the test points are 
assumed to be independent of the training sample S. The generalization error of the hypothesis hs 
trained on S is then: 

Rihs) = E[c{hs, z)\S]= E[c{hs, z)]. (5) 

2 2 

This setting seems less natural since, if samples are dependent, future test points must also depend 
on the training points, even if that dependence is relatively weak due to the time interval after which 
test points are drawn. Nevertheless, it is this somewhat less re alistic setting that has been studied by 



aU previous m achine learning studies that we are aware of ^ Yul 1 1 994l : iMeitl l200d : I Vidyasagatl 2003 



Lozano et all [2006), even when examining specifically a time series prediction problem (Meir 



20001) . Thus, the bounds derived in these studies cannot be directly applied to the more general 
setting. 

We will consider instead the most general setting with the definition of the generalization error 
based on Eq.|4l Clearly, our analysis also applies to the less general setting just discussed as well. 

Let us briefly discuss the more general scenario of non-stationary mixing sequences, that is one 
where the distribution may change over time. Within that general case, the generalization error of a 
hypothesis hs, defined straightforwardly by 

R{hs,t)= E [c{hs,zt)\S], (6) 

2t~a* 

would depend on the time t and it may be the case that R{hs, t) / R{hs, t') for t ^ t', making the 
definition of the generalization error a more subtle issue. To remove the dependence on time, one 
could define a weaker notion of the generahzation error based on an expected loss over all time: 

R{hs) = E[R{hs,t)]. (7) 

It is not clear however whether this term could be easily computed and useful. A stronger condition 
would be to minimize the generalization error for any particular targ et time. St u dies o f this type 



have been conducted for smoothly changing distributions, such as in IZhou et al.l (120081) . however, 
to the best of our knowledge, the scenario of a both non-identical and non-independent sequences 
has not yet been studied. 
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3. v9-Mixing Generalization Bounds and Applications 

This section gives generalization bounds for /3-stabIe algorithms over a mixing stationary distribu- 
tionjl The first two sections present our main proofs which hold for /J-mixing stationary distribu- 
tions. In the third section, we will briefly discuss concentration inequalities that apply to 99-mixing 
processes only. Then, in the final section, we will present our main results. 



The co ndition of /3-stabil i tv is a n algorithm-dependent property first intr oduced bvlDevrove and Wagner 



19791) and iKeams and RonI (1 19971) . It has been later used successfully by iBousquet and Elisseefi 



(|200IL 12002!) to show algorithm-specific stability bounds for i.i.d. samples. Roughly speaking, a 



learning algorithm is said to be stable if small changes to the training set do not produce large 
deviations in its output. The following gives the precise technical definition. 

Definition 4 A learning algorithm is said to be (uniformly) /3-stable if the hypotheses it returns for 
any two training samples S and S' that differ by a single point satisfy 



VzGXxy, \c{hs,z)-c{hs>,z)\<l3. 



(8) 



The use of stability in conjunction with McDiarmid's inequality will allow us to produce general- 
ization bounds. McDiarmid's inequality is an exponential concentration bound of the type. 



Pr[|$ - E[<I']| > e] < exp 



where the probability is over a sample of size m and I is the Lipschitz parameter of $ (which is also 
a function of m). Unfortunately, this inequality cannot be easily applied when the sample points are 
not distributed in an i.i.d. fashion. We will use the results of iKontorovich and Ramanani (|2006|) to 
extend the use of McDiarmid's inequality with general mixing distributions (Theorem |9ll. 

To obtain a stability-based generalization bound, we will apply this theorem to ^{S) = R{hs) — 
R{hs)- To do so, we need to show, as with the standard McDiarmid's inequality, that <1> is a Lipschitz 
function and, to make it useful, bound E[$]. The next two sections describe how we achieve both 
of these in this non-i.i.d. scenario. 

Let us first take a brief look at the problem faced when attempting to give stability bounds for 
de pendent sequences and give so me idea of our solution for that problem. The stability proofs given 
by lBousquet and Elisseefn ((20011) assume the i.i.d. property, thus replacing an element in a sequence 
with another does not affect the expected value of a random variable defined over that sequence. In 
other words, the following equality holds. 



E[V{Zu 



,Z^)] = E[V{Zi,...,Z',...,Zm)], 



(9) 



for a random variable V that is a function of the sequence of random variables S = (Zi, . . . , Zm)- 
However, clearly, if the points in that sequence S are dependent, this equality may not hold anymore. 

The main technique to cope with this problern is based on the so-called "independent block 
sequence" originally introduced by Bemsteinl (1927). This consists of eliminating from the original 
dependent sequence several blocks of contiguous points, leaving us with some remaining blocks of 



2. The standard variable used for the stabihty coefficient is /3. To avoid the confusion with the /3-mixing coefficient, we 
will use /3 instead. 
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points. Instead of these dependent blocks, we then consider independent blocks of points, each with 
the same size and the same distribution (within each block) as the dependent ones. By Lemma[3l for 
a /3-mixing distribution, the expected value of a random variable defined over the dependent blocks 
is close to the one based on these independent blocks. Working with these independent blocks 
brings us back to a situation similar to the i.i.d. case, with i.i.d. blocks re placing i.i.d. points . 

Our use of this method somewhat differs from previous ones (see IYuL 1 19941 : iMeiti BOOOT) where 
many blocks of equal size are considered. We will be dealing with four blocks and with typically 
unequal sizes. More specifically, note that for Equation |9] to hold, we only need that the variable 
Zi be independent of the other points in the sequence. To achieve this, roughly speaking, we will 
be "discarding" some of the points in the sequence surrounding Zj. This results in a sequence 
of three blocks of contiguous points. If our algorithm is stable and we do not discard too many 
points, the hypothesis returned should not be greatly affected by this operation. In the next step, 
we apply the independent block lemma, which then allows us to assume each of these blocks as 
independent modulo the addition of a mixing term. In particular, Zi becomes independent of all 
other points. Clearly, the number of points discarded is subject to a trade-off: removing too many 
points could excessively modify the hypothesis returned; removing too few would maintain the 
dependency between Zi and the remaining points, thereby producing a larger penalty when applying 
Lemma [3] This trade-off is made explicit in the following section where an optimal solution is 
sought. 



3.1 Lipschitz Bound 

As discussed in Section l2!2l in the most general scenario, test points depend on the training sample. 
We first present a lemma that relates the expected value of the generalization error in that scenario 
and the same expectation in the scenario where the test point is independent of the training sample. 
We denote by R(hs) = '^z\cihs^A\S\ the expectation in the dependent case and by R{hs^) = 
Ez[c{hsf, , z)] the expectation where the test points are assumed independent of the training, with 
Sh denoting a sequence similar to S but with the last b points removed. Figure [TJa) illustrates that 
sequence. The block Sb is assumed to have exactly the same distribution as the corresponding block 
of the same size in S. 

Lemma 5 Assume that the learning algorithm is P -stable and that the cost function c is bounded 
by M. Then, for any sample S of size m drawn from a (3-mixing stationary distribution and for any 
6 G {0, . . . , m}, the following holds: 

I ^[R{hs)\ - E[i?(H)]| < + (10) 

Proof The /3-stability of the learning algorithm implies that 

nR{hs)] = E[c(%,^)] < E[c(/i5„^)] + 6/3. (11) 

b o,z b,z 

The application of Lemma [3] yields 

E[R{hs)] < E [c(H, 2)] + bp + P{b)M = Es[R{hs,)] +bp + P{b)M. (12) 

S S,z 

The other side of the inequality of the lemma can be shown following the same steps. ■ 
We can now prove a Lipschitz bound for the function <I>. 
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Sb 



(a) 

Si.b 



b b b 

(c) 



(b) 

Ql 

'Ji,b 



(d) 



Figure 1: Illustration of the sequences derived from S that are considered in the proofs. 

Lemma 6 Let S = (zi, . . . , Zi, . . . , Zm) and 5"* = {zi, . . . , z^, . . . , Zm) be two sequences drawn 
from a [3-mixing stationary process that differ only in point i £ [i,iTi], and let hg and hgi be the 
hypotheses returned by a f3-stable algorithm when trained on each of these samples. Then, for any 
i S [1,?TT^], the following inequality holds: 



\<I>(S) - ^>(5')| < (6 + 1)2/3 + 2/3(6)M + — . 

m 



(13) 



Proo f To prove this inequality, we first bound the difference of the empirical errors as in (|Bousquet and Elisseefl , 



20021) . then the difference of the true errors. Bounding the difference of costs on agreeing points 



with P and the one that disagrees with M yields 

\R{hs) - R{hs^)\ = —y2\c{hs,Zj)-c{hs^,Zj)\^ \c{hs, Zi) - c{hs^, z-)] (14) 



m , 

n M 

< + — . 

m 



Since both R{hs) and R{hgi) are defined with respect to a (different) dependent point, we apply 
Lemma |5] to both generalization error terms and use /3-stability. This then results in 



\R{hs) - R{hs^)\ < \R{hs,)-R{hsi)\ + 2bp + 2m 



(15) 



= E[c{hs, , z) - cihg. , J)] + 26/3 + 2/?(6)M < /3 + 26/3 + 2/3(6)M. 
The lemma's statement is obtained by combining inequalities [141 and [T5l 



3.2 Bound on Expectation 

As mentioned earlier, to obtain an explicit bound after application of a generalized McDiarmid's 
inequality, we also need to bound Es'[<I>(5)]. This is done by analyzing independent blocks using 
Lemma [3l 



8 



Stability Bounds for Non-i.i.d. Processes 



Lemma 7 Let hg be the hypothesis returned by a (3-stable algorithm trained on a sample S drawn 
from a stationary ^-mixing distribution. Then, for all b £ [1, m], the following inequality holds: 



E[|$(S)|] < (66 + l)/3 + 3/?(6)M. 
s 



(16) 



Proof Let Sb be defined as in the proof of Lemma |5] To deal with independent block sequences 
defined with respect to the same hypothesis, we will consider the sequence S-i^t, = Si Pi St, which 
is illustrated by Figure [TJc). This can result in as many as four blocks. As before, we will consider 
a sequence Si^b with a similar set of blocks each with the same distribution as the corresponding 
blocks in S'j 5, but such that the blocks are independent. 

Since three blocks of at most b points are removed from each hypothesis, by the /3-stability of 
the learning algorithm, the following holds: 



fRihs) - R{hs)] = E 



^ m 

— y^c{hs,Zi) - c{hs,z) 



< E 



^ m 



,z,i) - c{hs^,,z) 



1=1 



+ 66/3. 



(17) 
(18) 



The application of Lemma [3] to the difference of two cost functions also bounded by M as in the 
right-hand side leads to 



m{s)] < E 



^ m 



1=1 



+ 66/3 + 3P{b)M. 



(19) 



Now, since the points z and Zi are independent and since the distribution is stationary, they have the 
same distribution and we can replace ij with 2 in the empirical cost. Thus, we can write 



m{s)] < E 



^ rn 

— Vc(/i?, ,z) - c{hg 



20 



i=l 



+ 66/3 + 3p{b)M < /3 + 66/3 + 3/3(6)M, 



where SI ^ is the sequence derived from Si^b by replacing Zi with z. The last inequality holds by 

/3-stability of the learning algorithm. The other side of the inequality in the statement of the lemma 
can be shown following the same steps. ■ 



3.3 99-mixing Generalization Bounds 



We are now prepared to make use of a concentration inequality to provide a generalization bound 
in the (^g-mixing sc enar io. Several concentrat ion inequaliti es have been shown in (p-rn ixing case, 
e.g. Marton (1998[): Isamson (200oh: Ichazottes et al. (20071): i Kontorovich and RamananI (I2OO6). We 



Kontorovich and Ramanad (120061) . which is very similar to that of 



will u se that of 

(|2007h modulo the fact that the latter requires a finite sample space. 



Chazottes et al. 



These concentration inequalities are generaUzations of the of following inequality of lMcDiarmid 



(Il989h commonly used in the i.i.d. setting 
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Theorem 8 dMcDiarmid (11989'), 6 .10) Let S = (Zi, . . . , Zm) be a sequence of random variables, 
each taking values in the set Z, then for any measurable function <I> : Z^ — > M that satisfies the 
following, G 1, . . . , myzi, z\ S Z, 



E 

s 



<^iS)\Zi = zu...,Z, = : 
for constants Ci. Then, for all e > 0, 

Pr[|^> - E[$] > e] < 2exp 



<^iS)\Zi = zi,...,Z, 



In the i.i.d. scenario, the requirement to produc e the constants c,; simply translate s into a Lips- 
chitz condition on the function <I>. Theorem 5.1 of Kontorovich and RamananI (l2006h bounds pre- 
cisely this quantity as followsjl 



Ci < 1 + 2 J^(^(A;). 



(20) 



k=l 



Given the bound in Equation |20l the concentration bound of McDiarmid can be restated as 
follows, making it easily accessible to 93-mixing distributions. 



Theorem 9 dKontorovich and RamananI (12006) ) Let <I> : R be a measurable function. If 

<I> is l-Lipschitz with respect to the Hamming metric for some I > 0, then the following holds for all 
€ > 0: 



(21) 



where ||Am||oo < 1 + 2'^Lp{k). 



k=l 



It should be pointed out that the stater nent of the theorem in th is paper is improved by a factor of 4 
in the exponent, from the one stated in lKontorovich and Ramana n (2006 ) Theorem 1.1. This can be 
achiev e straightforwardly by following the same steps as in the proof by iKontorovich and Ramanan 



(I2OO6I) and making use of the general form of McDiarmid's inequality (Theorem [H) as opposed to 
Azuma's inequality. 

This section presents several theorems that constitute the main results of this paper. The follow- 
ing theorem is constructed form the bounds shown in the previous three sections. 

Theorem 10 (General Non-i.i.d. Stability Bound) Let hs denote the hypothesis returned by a (3- 
stable algorithm trained on a sample S drawn from a ip-mixing stationary distribution and let c be 
a measurable non-negative cost function upper bounded by M > 0, then for any b G [0, m] and any 
e > 0, the following generalization bound holds 



Pr 

s 



R{hs)~Rihs 



> e 



{6b+l)l3 + 6Mip{b) 



< 2exp 



-262(i+2Er=i^w)' 



m((6 + 1)2/3 + 2M(p{b) + M/mf 



3. We should note that original bound is expressed in terms of 77-mixing coefficients. To simplify presentation, we 
are adapting it to the case of stationary (^-mixing sequences by using the f ollowing straightforward inequalit y for a 
stationary process: 2tp{i — i) > r]ij. Furthermore, the bound presented in 
when the sample space is countable, it is extended to the continuous case in|] 



Kontoro vich and RamananI l l2006h holds 
Kontorovich 120071^ 
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Proof The theorem follows directly the application of Lemma[6]and Lemnia|7]to Theorem |9] ■ 

The theorem gives a general stability bound for (/^-mixing stationary sequences. If we further as- 
sume that the sequence is algebraically (/3-mixing, that is for all k, (p{k) = ifok"^' for some r > 1, 
then we can solve for the value of b to optimize the bound. 



Theorem 11 (Non-i.i.d. Stability Bound for Algebraically Mixing Sequences) Let hs denote the 
hypothesis returned by a [i-stable algorithm trained on a sample S drawn from an algebraically (p- 
mixing stationary distribution, <p{k) = ipok~^ with r > 1 and let c be a measurable non-negative 
cost function upper bounded by M > 0, then for any e > 0, the following generalization bound 
holds 



Ft 

s 



R{hs)~Rihs) >e + p+{r + l)6Mip{b) 

r/(r+l) 



<2exp^ -2.^(1 + 2^o./(r-l)r 



m(2/3 +{r + l)2Mip{b) + M/mf 



where ip{b) = (pQ 



Proof For an algebraically mixing sequence, the value of b minimizing the bound of Theorem [TO] 

/ 3 \-l/(r-+l) / A \r/(r+l) 

satisfies pb = rMip{b), which gives b = (^^:^ j and ip{b) = ipo (^^:^ j . The 

following term can be bounded as 



i=l i=l 

,1-r 



'' — 1 

1 -r 



Using the assumption r > 1, we upper bound m ^ with 1 and find that, 

/ m^~''-l\ ( 1 \ 2(/?or 
1 + 2(^0 1 + — < 1 + 2990 1 + 7 = 1 + 



1 — r J \ r — 1 J r — 1 

Plugging in this value and the minimizing value of b in the bound of Theorem [lO] yields the state- 
ment of the theorem. ■ 

In the case of a zero mixing coefficien t (ip = and b = ) , the bounds of Theorem [10] coin- 



cide with the i.i.d. stability bound of ( Bousquet and Elisseefj. 120021) . In order for the right-hand 



side of these bounds to converge, we must have (3 = o(l/^/m) and ip(b) =_jo(l/-y/m). For sev- 
eral general classes of algorithms, /3 < 0(l/m) ( Bousquet and ElisseeflL 2002h . In the case of 



algebraically mixing sequences with r > 1, as assumed in Theorem [TTl f3 < 0{l/m) implies 
ip{b) = ipo0 / {ripoM))^'^^ ^'^'^^^^ < 0{l/y/rn). The next section illustrates the application of The 
orem[TT]to several general classes of algorithms. 

We now present the application of our stability bounds to several algorit hms in the case of an al 



gebraically mixing sequence. We make use of the stability analysis found in lBousquet and Elisseeff 



(|2002h . which allows us to apply our bounds in the case of kernel regularized algorithms, A:-local 



rules and relative entropy regularization. 
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3.4 Applications 

3.4.1 Kernel Regularized Algorithms 

Here we apply our bounds to a family of algorithms based on the minimization of a regularized 
objective function based on the norm || • H^- in a reproducing kernel Hilbert space, where if is a 
positive definite symmetric kernel: 

^ m 

argmin — V'c(/i, Zj) + All/illl-. (22) 

hen 

The application of our bound is possible , under some general conditions, since kernel regularized 
algorithms are stable with (3 < 0{l/m) (|Bousquet and Elissee ff . 2002). Here we briefly reproduce 
the proof of this /3-stability for the sake of completeness; first we introduce some needed terminol- 
ogy- 

We will assume that the cost function c is a -admissible, that is there exists a G M+ such that 
for any two hypotheses h,h' £ H and for all z = {x,y) £ X xY, 

\c{h,z) -c{h',z)\ <a\h{x) -h'{x)\. (23) 

This assumption holds for the quadratic cost and most other cost functions when the hypothesis set 
and the set of output labels are bounded by some M S M+: V/i £ H,\/x £ X, \ h{x)\ < M and 

G Y, \y\ < M. We will also assume that c is differentiable. This assumption is in fact not 
necessary and all of our results hold without it, but it makes the presentation simpler. 

We denote by Bp the Bregman divergence associated to a convex function F: BF{f\\g) = 
F{f) — F{g) — if — g, VF{g)). In what follows, it will be helpful to define F as the objective 
function of a general regularization based algorithm, 

Fs{h) = Rs{h) + XN{h), (24) 

where Rs is the empirical error as measured on the sample S, N : H ^ is a regularization 
function and A > is the usual trade-off parameter. Finally, we shall use the shorthand Ah = h' — h. 

Lemma 12 feous quet and Elisseefl ( 2002 )) A kernel regularized learning algorithm, (|22]), with 
bounded kernel K{x, x) < k < oo and a-admissible cost function, is ^-stable with coefficient. 



mX 



Proof Let h and h' be the minimizers of Fs and Fg respectively where S and S' differ in the first 
coordinate (choice of coordinate is without loss of generality), then, 

BN(h'\\h) + BN{h\\h') < ^ sup \Ah(x)\. (25) 

To see this, we notice that since Bp = Bj^ + XB^q, and since a Bregman divergence is non-negative, 

X{BN{h'\\h) + BN{h\\h')) < Bp.ih'Wh) + Bp^,{h\\h'). 
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By the definition of h and h' as tlie minimizers of Fs and Fs', 

BFsih'Wh) + BF^,{h\\h') = Rpsih') - Rpsih) + RFs,{h) - RPs^ih'). 
Finally, by the cr-admissibility of the cost function c and the definition of S and S', 

X{BNih'\\h)+BNih\\h')) < RFs{h')-Rp^{h) + RF^,(h)-RFs>{h') 

c{h' , zi) — c{h, zi) + c{h, z[) — c{h' , z'l) 



1 

m 



a\Ah{xi)\ + a\Ah{x[) 



1 

< — 

m 

< — sup|A/i(x) 



which establishes (I25I ). 

Now, if we consider A^(-^ 
BN{h\\h') = 2||A/i|||- and by 



W, we have BN{h'\\h) = \\h' - h\\]^, thus BN{h'\\h) + 



and the reproducing kernel property, 
2a 



2\\Ah\\l^ < 



< 



mX 



sup \ Ah{x) 

x€S 

K\\Ah\\K. 



Thus ||A/i||ii- < And using the cj-admissibility of c and the kernel reproducing property we 
get, 

\/z e X xY, \c{h', z) - c{h, z)\ < a\Ah{x)\ < Ka\\Ah\\K. 

Therefore, 

yz£X xY, \c{h', z) - c{h, z)\ < 

which completes the proof. 



mX 



Three specific instances of kernel regularization algorithms are SVR, for which the cost function 
is based on the e-insensitive cost: 



c{h,z) = \h{x) - y\ 





\h{x) - y\ 



if \h{x) -y\ <e, 
e otherwise. 



Kernel Ridge Regression (ISaunders et al.L 119981) . for which 

c{h,z) = {h{x) - yf , 
and finally Support Vector Machines with the hinge-loss, 

'O if 1 - < 0, 

c{h, z) = \ l- yh{x) if < yh{x) < 1, 
1 ifyh{x)<0. 



(26) 



(27) 



(28) 



We note that for kernel regularization algorithms, as pointed out in iBousquet and Elisseeff 



(|2002l . Lemma 23), a bound on the labels immediately implies a bound on the output of the hy- 
pothesis produced by equation (l22l ). We formally state this lemma below. 
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Lemma 13 Let h* be the solution to equation f l22D . let cbe a cost function and let B{ ) be a real- 
valued function such that\/y € {y | 3x £ X,3h £ H,y = h{x)},\/y' G Y, 

c{y,y') < B{y). 

Then, the output ofh* is bounded as follows, 



wx £ x,\h*{x)\ < fiy^Y', 

where A is the regularization parameter, and > K[x, x) for all x £ X. 

Proof Let F{h) = ^ c{h, Zi) + \\h\\ and let be the zero hypothesis, then by definition 
of F and h* , 

\WfK < F(h*) < F(0) < 5(0). 
Then, using the reproducing kernel property and the Cauchy-Schwartz inequality we note. 



Vx £ X,\h*{x)\ = {h*,K{x,-)) < \\h*\\KVK{x,x) < 
Combining the two inequalities produces the result. 



We note that in lBousquet and Elisseefn (|2002r) . the following the bound is also stated: c{h* (x) ,y')< 
B{^i^JB{Q)/\). However, when later applied it seems the authors use an incorrect upper bound 
function B{-), which we remedy in the following. 

Corollary 14 Assume a bounded output Y = [0, B], for some B > 0, and assume that K{x, x) < 
K? for all xfor some k > 0. Let hs denote the hypothesis returned by the algorithm when trained 
on a sample S drawn from an algebraically ip-mixing stationary distribution. Let u = r/(r + 1) £ 
[i, 1], M' = 2(r + l)v3oM/(r99oAf)", and c^f, = (1 + 2ifor/{r - 1)). Then, with probability at 
least 1 — 5, the following generalization bounds hold for 

a. Support Vector Machines (SVM, with hinge-loss) 

'2 /^2X«3^^/ / ^2 / 2X« ^/ N /21og(2/5) 



where M = 1. 
b. Support Vector Regression (SVR): 



Am V A / m" \ A \ A / m" ^ / V m 



^ H., X - ,'k.^Y3M' ,/,^ k2 /^2X« \ /21og(2/5) 



where M = kx/ ^ + B. 



c. Kernel Ridge Regression (KRR): 



where M = k^B^/\ + B"^. 
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Proof For SVM, the hinge-loss is 1-admissible giving $ < K?/{\m), and the cost function is 
clearly bounded by M = 1. 

Similarly, SVR has a loss function that is l-admissible, thus, applying Lemma [12] gives us 
$ < 1^'^ / (Xm). Using Lemma [T3l with -B(O) = B, we can bound the loss as follows: Vx G X, y e 

Y,\h*{x)-y\ < K^J^ + B. 

Finally for KRR, we have a loss function that is 2i?-admissible and again using Lemma [T2l 
/3 < AK^B'^/{\m). Again, applying Lemma O with 5(0) = B"^ and & X,y ^Y, {h*{x) - 
y)^ <K^ByX + B\ 

Plugging these values into the bound of Theorem [TT] and setting the right-hand side to 6 yields 
the statement of the corollary. ■ 



3.4.2 Relative Entropy Regularized Algorithms 

In this section we apply Theorem [TT] to algorithms that produce a hypothesis h that is a convex 
combination of base hypotheses ^ H which are parameterized by ^ G B. Thus, we wish to learn 
a weighting function (7 G G : O — > M that is a solution to the following optimization, 

^ m 

gmm-y^c{g,Zi) + \D{g\\go), (29) 



m. 

i=l 



geG m 



where the cost function c : G x Z — > M is defined in term of a second internal cost function 

c' : H X Z ^R: 



i9,z) = / c{hg,z)g{e)de, 
Je 



and where D is the KuUback-Leibler divergence or relative entropy regularizer (with respect to 
some fixed distribution go): 



D{g\\go)= [ g{9)\n^de. 



It has been shown, (iBousquet and Elisseeffl. |2002| . Theorem 24), that an algorithm satisfying 
equation [29] and with bounded loss c'(-) < M, is /3-stable with coefficient 

Am 

The application of our bounds, results in the following corollary. 

Corollary 15 Let hs be the hypothesis produced by the optimization in d29l ). with internal cost 
function d bounded by M. Then with probability at least 1—5, 



RiHs) < Rihs) + _ + _ + ^i(M + — + • ' 



m 



where u = r/{r + 1) G [\, 1], M' = 2(r + l)ipQM''+^ / {npoY , and (^f, = (1 + 2LpQr/{r - 1)). 
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3.5 Discussion 

The results presented here are, to the best of our knowledge, the first stability-based generalization 
bounds for the class of algorithms just studied in a non-i.i.d. scenario. These bounds are non-trivial 
when the condition on the regularization A ^ l/m^/^"^/'' parameter holds for all large values of m. 
This condition coincides with the i.i.d. condition, in the limit, as r tends to infinity. The next section 
gives stability-based generalization bounds that hold even in the scenario of /3-mixing sequences. 



4. /^-Mixing Generalization Bounds 

In this section, we prove a stability-based generahzation bound that only requires the training se- 
quence to be drawn from a stationary /3-mixing distribution. The bound is thus more general and 
covers the (^-mixing case analyzed in the previous section. However, unlike the 99-mixing case, 
the /3-mixing bound presented here is not a purely exponential bound. It contains an additive term, 
which depends on the mixing coefficient. 

As in the previous section, ^{S) is defined by ^{S) = R{hs) — R{hs)- To simplify the 
presentation, here, we will define the generalization error of hs by R{hs) = Ez[c{hs,z)]. Thus, 
test samples are assumed independent of S. By Lemma[5l this can be assumed modulo the additional 
term b(5+Mj3{b), for a cost function bounded by M. Note that for any block of points Z = zi . . . 
drawn independently of S, the following equality 



-I k k 



zez 



(30) 



holds since, by stationarity, E^. [c(/i5', Zj)] = Ezj[c{hs, zj)] for all I < i,j < k. Thus, R{hs) = 
Ez [|^ Z^zez '^(^5' ^)] foJ" ^'^y ^^^^ block Z. For convenience, we will extend the cost function c 
to blocks as follows: 

c(/i,^) = ^ J]c(/i,z). (31) 



With this notation, R{hs) = Ez[c{hs, Z)] for any block drawn independently of S, regardless of 
the size of Z. 

To derive a generalization bound for the /3-mixing scenario, we will apply McDiarmid's in- 
equality to ^ defined over a sequence of independent blocks. The independent blocks we will be 
cons idering are non-symmetric and t hus m ore general than those considered by previous authors 



(lYuLll994l : lMeiiil200nl : lLozano et al 



20061). 



From a sample S made of a sequence of m points, we construct two sequences of blocks Sa and 
Sb, each containing fi blocks. Each block in Sa contains a points and each block Sb in contains b 
points. Sa and S), form a partitioning of S; for any a, 6 G [0, m] such that {a + b)jjL = m, they are 
defined precisely as follows: 



Sa = {zt\...,z^;)), withzf") 

5, = (Zf ZW), with zf) = 



- ^(i-l)(a+6)+l; • • • ) ^(i-l)(a+b)+a 
^{i-l){a+b)+a+l ) • • • ) ^(i-l)(a+6)+a+6 1 



(32) 



for all i G [1, /u]- We shall consider similarly sequences of i.i.d. blocks Zf and Z^, i G [1, /u], such 
that the points within each block are drawn according to the same original /^-mixing distribution 
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and shall denote by Sa the block sequence {z["'\ . . . , zj^^). In preparation for the application 
of McDiarmid's inequality, we give a bound on the expectation of ^{Sa)- Since the expectation 
is taken ove r a sequence of i.i.d . blocks, this brings us to a situation similar to the i.i.d. scenario 
analyzed by lBousquet and Ehsseefl(,2002) . with the exception that we are dealing with i.i.d. blocks 
instead of i.i.d. points. 



Lemma 16 Let Sa be an independent block sequence as defined above, then the following bound 
holds for the expectation of\^{Sa)\: 

EmSa)\]<ap. 

Sa 

Proof Since the blocks Z^") are independent, we can replace any one of them with any other 
block Z drawn from the same distribution. However, changing the training set also changes the 
hypothesis, in a limited way. This is shown precisely below. 



Sa 



= E 

Sa 

< E 

Sa,Z 

= 

Sa,Z 



i=l 

^ i=i 
1 

- V'c(/i?, ,Z) -c(/i? ,Z) 



i=l 



where 5* corresponds to the block sequence Sa obtained by replacing the ith block with Z. The 
inequality holds through the use of Jensen's inequality. The /3-stability of the learning algorithm 
gives 



E 

Sa,Z 



-\Y.<^SV^)-<^Sa^^^ 
^ i=l 



< E 

Sa,Z 



1 ^ 



< a(3. 



We now relate the non-i.i.d. event Pr[<I>(5) > e] to an independent block sequence event to which 
we can apply McDiarmid's inequality. 

Lemma 17 Assume a (5-algorithm. Then, for a sample S drawn from a stationary (3-mixing distri- 
bution, the following bound holds. 



Pr[|$(5)| > e] < Pr [\^{Sa)\ - E[MSa)\] > e',] + (/. - 

Sa 



(33) 



where e'r, = e 



^_^_im_2^,bP-E^,[mS'a)\]. 



Proof The proof consists of first rewriting the event in terms of Sa and Sh and bounding the error 
on the points in Sf, in a trivial manner. This can be afforded since b will be eventually chosen to be 
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small. Since | Ez'[c{hs, Z')] — c{hs, z')\ < M for any z' G Sf,, we can write 



Ft[MS)\ >e] = Fv[\R{hs) - R{hs)\ > e] 



= Pr 

s 


' 1 

m 


^E[cihs,Z)]- 


- c{hs,z) 


> e 








< Pr 

s 


' 1 

m 


zeSa 


- c{hs,z) 


+ - E mhs,Z')]-c{hs,z') 


> e 


< Pr 

s 


' 1 

m 


Y^fc{hs,Z)] 


- c{hs,z) 


ubM 

+ - > e 

m 







By /3-stability and ^a/m < 1, this last term can be bounded as follows 

^\Y^E[c{hs,Z)]-c{hs,z)\ + i^>e 



Pr 

s 



m I ^ — ' z 

zeSa 



< 



Pr 

Sa 



m 



The right-hand side can be rewritten in terms of $ and bounded in terms of a /3-mixing coefficient: 

libM 



Pr 

Sa 



— I E[c(/i5. , Z)] - c{hs, ,z) +^:^ + 2fibf3 > e 



ua I ^ — ' Z 

^ z€Sa 



= Pr 
<Pr 



mSa)\ + ^ + 2fib(3>e 
m 

mSa)\ + ^ + 2fib0>e 
m 



by applying Lemma[3]to the indicator function of the event ||<I>(S'a 
Eg, (5*^)1] is a constant, the probability in this last term can be rewritten as 



+ (^-l)/3(6), 
2fibf3 > e|. Since 



Pr 

Sa 



|<^>(Sa)| + ^ + 2/x6/3]>6 
m 



:Pr 

Sa 

Pr 

Sa 



S' m SL 



msa)\-Ems'a)\]>e'o 

Si. 



which ends the proof of the lemma. 



The last two lemmas will help us prove the main result of this section formulated in the following 
theorem. 

Theorem 18 Assume a (3-stable algorithm and let e' denote e — — 2/i6/3 — ajs as in LemmaUT] 
Then, for any sample S of size m drawn according to a stationary ^-mixing distribution, any choice 



18 



Stability Bounds for Non-i.i.d. Processes 



of the parameters a,b, fx > such that (a + = m, and e > such that e' > 0, the following 
generalization bound holds: 



Pr 

s 



\R{hs)-R{hs)\>e 



< exp 



(2a/3m + (a + 6)M) 



Proof To prove the statement of theorem, it suffices to bound the probability term appearing in the 
right-hand side of Equation l33l Pr^ [|<^>(5a)| — E[|<I>(5a)]| > eg] , which is expressed only in terms 
of independent blocks. We can therefore apply McDiarmid's inequality by viewing the blocks as 
i.i.d. "points". 

To do so, we must bound the quantity | |<I>(S'a)| — |<I>(S* )| | where the sequence Sa and S** differ 
in the ith block. We will bound separately the difference between the generalization errors and 
empirical errors0 The difference in empirical errors can be bounded as follows using the bound on 
the cost function c: 



\R{hsJ-R{h 



^c{hs,,Zj) - c{hs^^ ,Zj) +- [c{hs, , Zi) - c{hs,^, Z^)] 



^ M . ia + h)M 

< a(3+ — = aP + - — . 

jjL m 



The difference in generalization error can be straightforwardly bounded using /3-stability: 

\R{hs^) - R{hsO\ 
Using these bounds in conjunction with McDiarmid's inequality yields 

Pr[|$(5a)|-E[|$(5^)|]>e[,]<exp 

Sa S' 



E[c{hs^,Z)]-E[c{hs.,Z)]\ = \E[c{hs^,Z)-c{hs.,Z)]\ < a/3. 



< 



exp 



{2apm + {a + b)M) J 
Note that to show the second inequality we make use of Lemma [T6] to estabilish the fact that 

_ /i5M _ 2^^^ _ > , _ _ 2^bp -aP = e'. 

m Qi m 

Finally, we make use of Lemma [l7] to establish the proof, 

PAMS)\ > e] < Pr [MSa)\ - E[MSa)\] > e',] + - 

Sa 



< exp 



{2a/3m+ {a + b)My 



+ (/x-l)/3(6). 



4. We drop the superscripts on since we will not be considering the sequence Sb in what follows. 
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In order to make use of the bounds, we must select the values of parameters b and fi (a is then 
equal to fi/m — u). There is a trade-off between choosing large value for b, to ensure the mixing 
term decreases, while choosing a large value of ^, to minimize the remaining terms of the bound. 
The exact choice of parameters will depend on the type of mixing that is assumed (e.g. algebraic or 
exponential). In order to choose optimal parameters, it will be useful to view the bound as it holds 
with high probability, in the following corollary. 

Corollary 19 Assume a f3-stable algorithm and let 5' denote 5—{jjL — l)(5{b). Then, for any sample 
S of size m drawn according to a stationary (3-mixing distribution, any choice of the parameters 
a,b, fi > such that {a + b)fi = m, and 5 > such that 5' > 0, the following generalization bound 
holds with probability at least (1 — 5): 



\R{hs) - R{hs)\ < y °^2m^^ (^"^"^ + ^^7) + ^ ^ 

In the case of a fast mixing distribution, it is possible to select the values of the parameters 



to retrieve a bound as in the i.i.d. case, i.e. \R{hs) — R{hs)\ G Oyn 2 y^log 1/6 j. In par- 
ticular, for P(b) = 0, we ca n choose a = 0, 6 = 1 and = m to retrieve the i.i.d. bound of 



Bousquet and Ehsseefj (120011) . 

In the following, we will examine slower mixing algebraic /^-mixing distributions, which are 
thus not close to the i.i.d. scenario. For algebraic mixing the mixing parameter is defined as (3{b) = 
b~^. In that case, we wish to minimize the following function in terms of and b. 

s{f,,b) = r^ + + + ^b{-+(3). (34) 

b^ /i /i \m J 

The first term of the function captures the condition on (5 > (/x + « ix/b^ and the remaining 

terms capture the shape of the bound in Corollary [191 

Setting the derivative with respect to each variable /i and b to zero and solving for each parameter 
results in the following expressions: 

1 ^3/4 2{r + l) 

b = Cri-—.^ /i = -==L==, (35) 

where 7 = (m~ + (3) and Cr = r^+'^ is a constant defined by the parameter r. 

Now, assuming /? G 0{m~") for some < a < 1, we analyze the convergence behavior of 
Corollary [19] First, we notice that the terms b and fi have the following asymptotic behavior. 



b&0\mr+iy ^ieO\m^ 2{r+i) ) . (36) 
Next, we consider the condition 5' > which is equivalent to, 

(5 > (/X - l)(3{b) G O ( m^~°(^"5{^^ ) . (37) 



20 



Stability Bounds for Non-i.i.d. Processes 



In order for the right-hand side of the inequality to converge, it must be the case that a > 

In particular, if a = 1, as we have shown is the case for several algorithms in Section [331 then it 

suffices that r > 1. 

Finally, in order to see how the bound itself converges, we study the asymptotic behavior of the 
terms of Equation [34] ( without the first term, which corresponds to the quantity already analyzed in 
Equation [371): 

- m^/2 nb ( ^^J^-—^\ i\ 

-+HbP+ + —eO{m* ^ 2(r + l); + ^2(r + l) 4 _ (38) 



H (b) (a) (6) 

This expression can be further simplified by noticing that (6) < (a) for all < a < 1 (with equality 
at a = 1). Thus, both the bound and the condition on 6 decrease asymptotically as the term in (a), 
resulting in the following corollary. 



-1 1 _ J. 

Corollary 20 Assume a (3-stable algorithm with (5 e 0{m^ ) and let 6' = 6 — Then, 

for any sample S of size m drawn according to a stationary algebraic (3-mixing distribution, and 

6 > such that 5' > 0, the following generalization bound holds with probability at least (1 — 5): 

\R{hs) - R{hs)\ < O (m^--*y/log{l/6')] . (39) 



As in previous bounds r > 1 is required for convergence. Furthermore, as expected, a larger mixing 
parameter r leads to a more favorable bound. 



5. Conclusion 

We presented stability bounds for both (/7-mixing and /3-mixing stationary sequences. Our bounds 
apply to large classes of algorithms, including common algorithms such as SVR, KRR, and SVMs, 
and extend to non-i.i.d. scenarios existing i.i.d. stability bounds. Since they are algorithm-specific, 
these bounds can often be tighter than other generalization bounds based on general complexity 
measures for families of hypotheses. As in the i.i.d. case, weaker notions of stability might help 
further improve and refine these bounds. 

Our bounds can be used to analyze the properties of stable algorithms when used in the non-i.i.d 
settings studied. But, more importantly, they can serve as a tool for the design of novel and accurate 
learning algorithms. Of course, some mixing properties of the distributions need to be known to 
take advantage of the information supplied by our generalization bounds. In some problems, it is 
possible to estimate the shape of the mixing coefficients. This should help devising such algorithms. 
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